[Misc][Main2Main] Upgrade vLLM to 0429(DSV4/v0.20.0)#8856
Conversation
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the vLLM integration to a newer version, necessitating significant refactoring of the Mixture-of-Experts (MoE) implementation. The changes streamline the MoE layer architecture by consolidating shared expert logic and updating internal APIs to match the upstream vLLM structure. Additionally, minor enhancements were made to the rejection sampler and worker compilation reporting to support new functionality and interface requirements. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:\n\n\n[Ops][Feature] Refactor Ascend Fused MoE and update worker compilation times\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR refactors the Ascend Fused MoE implementation by merging AscendSharedFusedMoE into AscendFusedMoE and updating the runner to inherit from MoERunner. It introduces shared expert consistency validation and multi-stream overlap configurations. Additionally, it updates the rejection sampler with synthetic mode parameters and modifies the worker to return structured CompilationTimes. Feedback includes fixing a function call bug where enable_sp was used as a boolean, removing placeholder comments, and cleaning up unreachable code in the worker.\n\n### Does this PR introduce _any_ user-facing change?\nNo significant user-facing API changes, though internal MoE logic and worker return types for compilation are updated.\n\n### How was this patch tested?\nThe PR includes a consistency check for shared expert computation.\n\nFixes #\n
…n Ascend (vLLM PR #40860) vLLM PR #40860 ([Feat] DeepSeek V4 Rebased) introduced resolve_kv_cache_block_sizes() into engine/core.py and added a restriction that hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1), raising: ValueError: Hybrid KV cache groups with multiple block sizes do not support context parallelism (dcp_world_size/pcp_world_size > 1). This restriction is correct for CUDA (the CUDA MLA implementation cannot combine hybrid KV with CP), but Ascend has dedicated CP backends for MLA (mla_cp.py) and SFA (sfa_cp.py) that handle this combination. Fix by patching resolve_kv_cache_block_sizes() to skip the ValueError for multiple-groups + CP on Ascend, and instead compute scheduler_block_size as lcm(group_block_sizes) * dcp * pcp for proper alignment. Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
| ) -> tuple[int, int]: | ||
| """Ascend-compatible resolve_kv_cache_block_sizes. | ||
|
|
||
| vLLM PR #40860 added a restriction that hybrid KV cache groups with |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
### What this PR does / why we need it? Based on #8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it? Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it? Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: yangzhe-2026 <yangzhe@isrc.iscas.ac.cn>
### What this PR does / why we need it? Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com>
### What this PR does / why we need it? Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
### What this PR does / why we need it? Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
### What this PR does / why we need it? Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com>
### What this PR does / why we need it? Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com>
### What this PR does / why we need it? Based on vllm-project#8856. Sync to vLLM `4d51588e2381018348f1022dfa3a7698899805b7`. Fix: --- - MoE refactor @wxsIcey, introduced by vllm-project/vllm#35782, vllm-project/vllm#35949, vllm-project/vllm#40560, vllm-project/vllm#40671. - `TypeError: rejection_sample() got an unexpected keyword argument 'synthetic_mode'` -> Add `synthetic_mode` and `synthetic_conditional_rates` param to ascend `rejection_sample()`. --- | # | Error | Category | Upstream Commit | Affected vllm-ascend Path | Fix | | :- | :------------------------------------------------- | :-------- | :----------------------------------------------- | :--------------------------------------------------------- | :----------------------------------------------- | | 1 | `encoder_compilation_time` AttributeError | Code Bug | `c08f3b2a6` ([#39240](vllm-project/vllm#39240)) | `worker/worker.py:567` | `getattr` fallback | | 2 | `AscendRMSNormGated activation` TypeError | Code Bug | `893611813` ([#40245](vllm-project/vllm#40245)) | `ops/layernorm.py:160`, `_310p/ops/layernorm.py:43` | Accept `activation` kwarg | | 3 | `AscendFusedMoEMethod.apply topk_weights` TypeError| Code Bug | many (e.g., `5e584ce9e` ([#35782](vllm-project/vllm#35782)), `809d83c2d` ([#40560](vllm-project/vllm#40560)), `4d51588e2` ([#40860](vllm-project/vllm#40860))) | `ops/fused_moe/fused_moe.py:107` | Major refactor — follow-up PR | | 4 | `_all_lora_classes` is tuple | Code Bug | `a250f1bd5` ([#35077](vllm-project/vllm#35077)) | `lora/utils.py:188` | Rebuild tuple instead of `.add()` | | 5 | `ProfilingChunkScheduler hash_block_size` TypeError| Code Bug | `7b1bc0a3e` ([#40946](vllm-project/vllm#40946)) | `core/scheduler_profiling_chunk.py:57` | Forward new kwarg | | 6 | `_moe_C.topk_softmax` AttributeError | Code Bug | MoE router refactor | router dispatch override needed | Provide `torch_npu` topk-softmax (with Issue 4) | | 7 | global experts shape mismatch | Code Bug | follow-on of Issue 4 | `quantization/methods/w8a8_dynamic.py:198` | Resolve once Issue 4 is fixed | - vLLM main: vllm-project/vllm@d886c26 --------- Signed-off-by: wxsIcey <1790571317@qq.com> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: shen-shanshan <467638484@qq.com> Co-authored-by: wxsIcey <1790571317@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhuqi <z00480217@china.huawei.com> Signed-off-by: ZhuQi-seu <zhuqi12@huawei.com>
What this PR does / why we need it?
Cherry pick #8841 and then continue to upgrading vLLM.
encoder_compilation_timeAttributeErrorc08f3b2a6(#39240)worker/worker.py:567getattrfallbackAscendRMSNormGated activationTypeError893611813(#40245)ops/layernorm.py:160,_310p/ops/layernorm.py:43activationkwargEagleCudaGraphManagerImportError4c7c69b4e(#40410)worker/v2/spec_decode/eagle/{speculator,aclgraph}.pyAscendFusedMoEMethod.apply topk_weightsTypeError5e584ce9e(#35782),809d83c2d(#40560),4d51588e2(#40860))ops/fused_moe/fused_moe.py:107_all_lora_classesis tuplea250f1bd5(#35077)lora/utils.py:188.add()ProfilingChunkScheduler hash_block_sizeTypeError7b1bc0a3e(#40946)core/scheduler_profiling_chunk.py:57_moe_C.topk_softmaxAttributeErrortorch_nputopk-softmax (with Issue 4)quantization/methods/w8a8_dynamic.py:198Does this PR introduce any user-facing change?
How was this patch tested?